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Abstract 

Descriptive video service (DVS) provides linguistic de¬ 
scriptions of movies and allows visually impaired people to 
follow a movie along with their peers. Such descriptions are 
by design mainly visual and thus naturally form an inter¬ 
esting data source for computer vision and computational 
linguistics. In this work we propose a novel dataset which 
contains transcribed DVS, which is temporally aligned to 
full length HD movies. In addition we also collected the 
aligned movie scripts which have been used in prior work 
and compare the two different sources of descriptions. In 
total the Movie Description dataset contains a parallel cor¬ 
pus of over 54,000 sentences and video snippets from 72 
HD movies. We characterize the dataset by benchmark¬ 
ing different approaches for generating video descriptions. 
Comparing DVS to scripts, we find that DVS is far more 
visual and describes precisely what is shown rather than 
what should happen according to the scripts created prior 
to movie production. 


1. Introduction 

Audio descriptions (DVS - descriptive video service) 
make movies accessible to millions of blind or visually im¬ 
paired peopled DVS provides an audio narrative of the 
“most important aspects of the visual information” [58], 
namely actions, gestures, scenes, and character appearance 
as can be seen in Figures 1 and 2. DVS is prepared by 
trained describers and read by professional narrators. More 
and more movies are audio transcribed, but it may take up to 
60 person-hours to describe a 2-hour movie [42], resulting 
in the fact that only a small subset of movies and TV pro¬ 
grams are available for the blind. Consequently, automating 
this would be a noble task. 

In addition to the benefits for the blind, generating de¬ 
scriptions for video is an interesting task in itself requiring 
to understand and combine core techniques of computer vi- 

^ In this work we refer for simplicity to “the blind” to account for all 
blind and visually impaired people which benefit from DVS, knowing of 
the variety of visually impaired and that DVS is not accessible to all. 



DVS: Abby gets in the 
basket. 



Mike leans over and sees 
how high they are. 


Script: After a moment a 
frazzled Abby pops up in 
his place. 


Mike looks down to see - 
they are now fifteen feet 
above the ground. 



Abby clasps her hands 
around his face and 
kisses him passionately. 
For the first time in 
her life, she stops think¬ 
ing and grabs Mike and 
kisses the hell out of him. 


Figure 1: Audio descriptions (DVS - descriptive video ser¬ 
vice), movie scripts (scripts) from the movie “Ugly Truth”. 


sion and computational linguistics. To understand the visual 
input one has to reliably recognize scenes, human activities, 
and participating objects. To generate a good description 
one has to decide what part of the visual information to ver¬ 
balize, i.e. recognize what is salient. 

Large datasets of objects [18] and scenes [68, 70] had an 
important impact in the field and significantly improved our 
ability to recognize objects and scenes in combination with 
CNNs [38]. To be able to learn how to generate descrip¬ 
tions of visual content, parallel datasets of visual content 
paired with descriptions are indispensable [56]. While re¬ 
cently several large datasets have been released which pro¬ 
vide images with descriptions [51, 29, 47], video descrip¬ 
tion datasets focus on short video snippets only and are 
limited in size [12] or not publicly available [52]. TACoS 
Multi-Level [55] and YouCook [16] are exceptions by pro¬ 
viding multiple sentence descriptions and longer videos, 
however they are restricted to the cooking scenario. In con¬ 
trast, the data available with DVS provides realistic, open 
domain video paired with multiple sentence descriptions. It 
even goes beyond this by telling a story which means it al¬ 
lows to study how to extract plots and understand long term 
semantic dependencies and human interactions from the vi¬ 
sual and textual data. 

Figures 1 and 2 show examples of DVS and compare 
them to movie scripts. Scripts have been used for various 
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DVS: Buckbeak rears and at¬ 
tacks Malfoy. 

Script: In a flash, Buckbeak’s Malfoy freezes, 
steely talons slash down. 




DVS: Another room, the wife She smokes a cigarette with a 
and mother sits at a window latex-gloved hand, 
with a towel over her hair. 


Script: Debbie opens a win- She holds her cigarette with a 
dow and sneaks a cigarette. yellow dish washing glove. 



DVS: They rush out onto the 
street. 


Script: Valjean and Javert 

hurry out across the factory 
yard and down the muddy track 
beyond to discover - 



A man is trapped under a cart. 


A heavily laden cart has toppled 
onto the cart driver. 



Hagrid lifts Malfoy up. 


Looks down at the blood blos¬ 
soming on his robes. 



Putting the cigarette out, she 
uncovers her hair, removes the 
glove and pops gum in her 
mouth. 

She puts out the cigarette and 
goes through an elaborate rou¬ 
tine of hiding the smell of 
smoke. 



Valjean is crouched down be¬ 
side him. 


Valjean, Javert and Javert’s as¬ 
sistant all hurry to help, but 
they can’t get a proper purchase 
in the spongy ground. 



She pats her face and hands 
with a wipe, then sprays herself 
with perfume. 


She puts some weird oil in her 
hair and uses a wet nap on her 
neck and clothes and brushes 
her teeth. 



Javert watches as Valjean 
places his shoulder under the 
shaft. 


He throws himself under the 
cart at this higher end, and 
braces himself to lift it from be¬ 
neath. 



As Hagrid carries Malfoy away, 
the hippogriff gently nudges 
Harry. 

Buckbeak whips around, raises 
its talons and - seeing Harry - 
lowers them. 



She pats her face and hands 
with a wipe, then sprays herself 
with perfume. 


She sprays cologne and walks 
through it. 



Invert’s eyes narrow. 


Javert stands back and looks on. 


Figure 2: Audio descriptions (DVS - descriptive video service), movie scripts (scripts) from the movies “Harry Potter and 
the prisoner of azkaban”, “This is 40”, “Les Miserables”. Typical mistakes contained in scripts marked with red italic. 


tasks [43, 14, 49, 20, 46], but so far not for the video de¬ 
scription. The main reason for this is that automatic align¬ 
ment frequently fails due to the discrepancy between the 
movie and the script. Even when perfectly aligned to the 
movie it frequently is not as precise as the DVS because 
it is typically produced prior to the shooting of the movie. 
E.g. in Figure 2 see the mistakes marked with red. A typi¬ 
cal case is that part of the sentence is correct, while another 
part contains irrelevant information. 

In this work we present a novel dataset which provides 
transcribed DVS, which is aligned to full length HD movies. 
For this we retrieve audio streams from blu-ray HD disks, 
segment out the sections of the DVS audio and transcribe 
them via a crowd-sourced transcription service [2]. As the 
audio descriptions are not fully aligned to the activities in 
the video, we manually align each sentence to the movie. 
Therefore, in contrast to the (non public) corpus used in 
[59, 58], our dataset provides alignment to the actions in the 
video, rather than just to the audio track of the description. 
In addition we also mine existing movie scripts, pre-align 
them automatically, similar to [43, 14] and then manually 
align the sentences to the movie. 

We benchmark different approaches to generate descrip¬ 


tions. First are nearest neighbour retrieval using state-of- 
the-art visual features [67, 70, 30] which do not require 
any additional labels, but retrieve sentences form the train¬ 
ing data. Second, we propose to use semantic parsing of 
the sentence to extract training labels for recently proposed 
translation approach [56] for video description. 

The main contribution of this work is a novel movie 
description dataset which provides transcribed and aligned 
DVS and script data sentences. We will release sentences, 
alignments, video snippets, and intermediate computed fea¬ 
tures to foster research in different areas including video 
description, activity recognition, visual grounding, and un¬ 
derstanding of plots. 

As a first study on this dataset we benchmark several ap¬ 
proaches for movie description. Besides sentence retrieval, 
we adapt the approach of [56] by automatically extracting 
the semantic representation from the sentences using se¬ 
mantic parsing. This approach achieves competitive perfor¬ 
mance on TACoS Multi-Level corpus [55] without using the 
annotations and outperforms the retrieval approaches on our 
novel movie description dataset. Additionally we present an 
approach to semi-automatically collect and align DVS data 
and analyse the differences between DVS and movie scripts. 
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2. Related Work 

We first discuss recent approaches to video description 
and then the existing works using movie scripts and DVS. 

In recent years there has been an increased interest in 
automatically describing images [23, 39, 40, 50, 45, 40, 41, 
34, 61, 22] and videos [37, 27, 8, 28, 32, 62, 16, 26, 64, 55] 
with natural language. While recent works on image de¬ 
scription show impressive results by learning the relations 
between images and sentences and generating novel sen¬ 
tences [41, 19, 48, 56, 35, 31, 65, 13], the video description 
works typically rely on retrieval or templates [16, 63,26,27, 
37, 39, 62] and frequently use a separate language corpus to 
model the linguistic statistics. A few exceptions exist: [64] 
uses a pre-trained model for image-description and adapts 
it to video description. [56, 19] learn a translation model, 
however, the approaches rely on a strongly annotated corpus 
with aligned videos, annotations, and sentences. The main 
reason for video description lacking behind image descrip¬ 
tion seems to be a missing corpus to learn and understand 
the problem of video description. We try to address this lim¬ 
itation by collecting a large, aligned corpus of video snip¬ 
pets and descriptions. To handle the setting of having only 
videos and sentences without annotations for each video 
snippet, we propose an approach which adapts [56], by ex¬ 
tracting annotations from the sentences. Our extraction of 
annotations has similarities to [63], but we try to extract the 
senses of the words automatically by using semantic parsing 
as discussed in Section 5. 

Movie scripts have been used for automatic discovery 
and annotation of scenes and human actions in videos 
[43, 49, 20]. We rely on the approach presented in [43] 
to align movie scripts using the subtitles. [10] attacks the 
problem of learning a joint model of actors and actions in 
movies using weak supervision provided by scripts. They 
also rely on a semantic parser (SEMAFOR [15]) trained on 
FrameNet database [7], however they limit the recognition 
only to two frames. [11] aims to localize individual short 
actions in longer clips by exploiting the ordering constrains 
as weak supervision. 

DVS has so far mainly been studied from a linguistic 
prospective. [58] analyses the language properties on a non¬ 
public corpus of DVS from 91 films. Their corpus is based 
on the original sources to create the DVS and contains dif¬ 
ferent kinds of artifacts not present in actual description, 
such as dialogs and production notes. In contrast our text 
corpus is much cleaner as it consists only of the actual DVS. 
With respect to word frequency they identify that especially 
actions, objects, and scenes, as well as the characters are 
mentioned. The analysis of our corpus reveals similar statis¬ 
tics to theirs. 

The only work we are aware of, which uses DVS in con¬ 
nection with computer vision is [59]. The authors try to 
understand which characters interact with each other. For 


this they first segment the video into events by detecting di¬ 
alogue, exciting, and musical events using audio and visual 
features. Then they rely on the dialogue transcription and 
DVS to identify when characters occur together in the same 
event which allows them to defer interaction patterns. In 
contrast to our dataset their DVS is not aligned and they try 
to resolve this by a heuristic to move the event which is not 
quantitatively evaluated. Our dataset will allow to study the 
quality of automatic alignment approaches, given annotated 
ground truth alignment. 

There are some initial works to support DVS productions 
using scripts as source [42] and automatically finding scene 
boundaries [25]. However, we believe that our dataset will 
allow learning much more advanced multi-modal models, 
using recent techniques in visual recognition and natural 
language processing. 

Semantic parsing has received much attention in com¬ 
putational linguistics recently, see, for example, the tutorial 
[6] and references given there. Although aiming at general- 
purpose applicability, it has so far been successful rather 
for specific use-cases such as natural-language question an¬ 
swering [9, 21] or understanding temporal expressions [44]. 

3. The Movie Description dataset 

Despite the potential benefit of DVS for computer vision, 
it has not been used so far apart from [25, 42] who study 
how to automate DVS production. We believe the main rea¬ 
son for this is that it is not available in the text format, i.e. 
transcribed. We tried to get access to DVS transcripts from 
description services as well as movie and TV production 
companies, but they were not ready to provide or sell them. 
While script data is easier to obtain, large parts of it do not 
match the movie, and they have to be “cleaned up”. In the 
following we describe our semi-automatic approach to ob¬ 
tain DVS and scripts and align them to the video. 

3.1. Collection of DVS 

We search for the blu-ray movies with DVS in the “Au¬ 
dio Description” section of the British Amazon [1] and se¬ 
lect a set of 46 movies of diverse genres^. As DVS is only 
available in audio format, we first retrieve audio stream 

^2012, Bad Santa, Body Of Lies, Confessions Of A Shopaholic, Crazy 
Stupid Love, 27 Dresses, Flight, Gran Torino, Harry Potter and the deathly 
hallows Disk One, Harry Potter and the Half-Blood Prince, Harry Potter 
and the order of phoenix, Harry Potter and the philosophers stone, Harry 
Potter and the prisoner of azkaban. Horrible Bosses, How to Lose Friends 
and Alienate People, Identity Thief, Juno, Legion, Les Miserables, Mar- 
ley and me. No Reservations, Pride And Prejudice Disk One, Pride And 
Prejudice Disk Two, Public Enemies, Quantum of Solace, Rambo, Seven 
pounds, Sherlock Holmes A Game of Shadows, Signs, Slumdog Million¬ 
aire, Spider-Manl, Spider-Man3, Super 8, The Adjustment Bureau, The 
Curious Case Of Benjamin Button, The Damned united. The devil wears 
prada. The Great Gatsby, The Help, The Queen, The Ugly Truth, This is 
40, TITANIC, Unbreakable, Up In The Air, Yes man. 
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Movies 

Before alignment 
Words 

Words 

After alignment 
Sentences Avg. length 

Total length 

DVS 

46 

284,401 

276,676 

30,680 

4.1 sec. 

34.7 h. 

Movie script 

31 

262,155 

238,889 

23,396 

3.4 sec. 

21.7 h. 

Total 

72 

546,556 

515,565 

54,076 

3.8 sec. 

56.5 h. 


Table 1: Movie Description dataset statistics. Discussion see Section 3.3. 


from blu-ray HD disk^. Then we semi-automatically seg¬ 
ment out the sections of the DVS audio (which is mixed 
with the original audio stream) with the approach described 
below. The audio segments are then transcribed by a crowd- 
sourced transcription service [2] that also provides us the 
time-stamps for each spoken sentence. As the DVS is 
added to the original audio stream between the dialogs, 
there might be a small misalignment between the time of 
speech and the corresponding visual content. Therefore, we 
manually align each sentence to the movie in-house. 

Semi-Automatic segmentation of DVS. We first esti¬ 
mate the temporal alignment difference between the DVS 
and the original audio (which is part of the DVS), as they 
might be off a few time frames. The precise alignment is 
important to compute the similarity of both streams. Both 
steps (alignment and similarity) are computed using the 
spectograms of the audio stream, which is computed us¬ 
ing Fast Fourier Transform (FFT). If the difference between 
both audio streams is larger than a given threshold we as¬ 
sume the DVS contains audio description at that point in 
time. We smooth this decision over time using a minimum 
segment length of 1 second. The threshold was picked on a 
few sample movies, but has to be adjusted for each movie 
due to different mixing of the audio description stream, dif¬ 
ferent narrator voice level, and movie sound. 

3.2. Collection of script data 

In addition we mine the script web resources'^ and select 
26 movie scripts^ As starting point we use the movies fea¬ 
turing in [49] that have highest alignment scores. We are 
also interested in comparing the two sources (movie scripts 
and DVS), so we are looking for the scripts labeled as “Fi¬ 
nal”, “Shooting”, or “Production Draft” where DVS is also 
available. We found that the “overlap” is quite narrow, so 

^We use [3] to extract a blu-ray in the .mkv file, then [5] to select and 
extract the audio streams from it. 

^http://www. weeklyscript.com, http://www.simplyscripts.com, 

http ://www. daily script, com, http ://www.imsdb. com 

^Amadeus, American Beauty, As Good As It Gets, Casablanca, 
Charade, Chinatown, Clerks, Double Indemnity, Fargo, Forrest Gump, 
Gandhi, Get Shorty, Halloween, It is a Wonderful Life, O Brother Where 
Art Thou, Pianist, Raising Arizona, Rear Window, The Crying Game, The 
Graduate, The Hustler, The Lord Of The Rings The Fellowship Of The 
Ring, The Lord Of The Rings The Return Of The King, The Lost Weekend, 
The Night of the Hunter, The Princess Bride. 


we analyze 5 such movies^ in our dataset. This way we end 
up with 31 movie scripts in total. We follow existing ap¬ 
proaches [43, 14] to automatically align scripts to movies. 
First we parse the scripts, extending the method of [43] to 
handle scripts which deviate from the default format. Sec¬ 
ond, we extract the subtitles from the blu-ray disks^. Then 
we use the dynamic programming method of [43] to align 
scripts to subtitles and infer the time-stamps for the de¬ 
scription sentences. We select the sentences with a reliable 
alignment score (the ratio of matched words in the near-by 
monologues) of at least 0.5. The obtained sentences are then 
manually aligned to video in-house. 

3.3. Statistics and comparison to other datasets 

During the manual alignment we filter out: a) sentences 
describing the movie introduction/ending (production logo, 
cast etc); b) texts read from the screen; c) irrelevant sen¬ 
tences describing something not present in the video; d) 
sentences related to audio/sounds/music. Table 1 presents 
statistics on the number of words before and after the alig- 
ment to video. One can see that for the movie scripts the re¬ 
duction in number of words is about 8.9%, while for DVS it 
is 2.7%. In case of DVS the filtering mainly happens due to 
inital/ending movie intervals and transcribed dialogs (when 
shown as text). For the scripts it is mainly attributed to ir¬ 
relevant sentences. Note, that in cases when the sentences 
are “alignable” but have minor mistakes we still keep them. 

We end up with the parallel corpus of over 5OK video¬ 
sentence pairs and a total length over 56 hours. We com¬ 
pare our corpus to other existing parallel corpora in Table 2. 
The main limitations of existing datasets are single domain 
[16, 54, 55] or limited number of video clips [26]. We fill in 
the gap with a large dataset featuring realistic open domain 
videos, which also provides high quality (professional) sen¬ 
tences and allows for multi-sentence description. 

3.4. Visual features 

We extract video snippets from the full movie based on 
the aligned sentence intervals. We also uniformly extract 
10 frames from each video snippet. As discussed above 
DVS and scripts describe activities, object, and scenes (as 

^Harry Potter and the prisoner of azkaban, Les Miserables, Signs, The 
Ugly Truth, This is 40. 

^We extract .srt from .mkv with [-h]. It also allows for subtitle alignment 
and spellchecking. 
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Dataset 

multi-sentence 

domain 

sentence source 

clips 

videos 

sentences 

YouCook [26] 

X 

cooking 

crowd 


88 

2,668 

TACoS [54, 56] 

X 

cooking 

crowd 

7,206 

127 

18,227 

TACoS Multi-Level [55] 

X 

cooking 

crowd 

14,105 

273 

52,593 

MSVD [12] 


open 

crowd 

1,970 


70,028 

Movie Description (ours) 

X 

open 

professional 

54,076 

72 

54,076 


Table 2: Comparison of video description datasets. Discussion see Section 3.3. 


well as emotions which we do not explicitly handle with 
these features, but they might still be captured, e.g. by the 
context or activities). In the following we briefly introduce 
the visual features computed on our data which we will also 
make publicly available. 

DT We extract the improved dense trajectories compen¬ 
sated for camera motion [67]. For each feature (Trajectory, 
HOG, HOF, MBH) we create a codebook with 4000 clus¬ 
ters and compute the corresponding histograms. We apply 
LI normalization to the obtained histograms and use them 
as features. 

LSDA We use the recent large scale object detection 
CNN [30] which distinguishes 7604 ImageNet [18] classes. 
We run the detector on every second extracted frame (due 
to computational constraints). Within each frame we max- 
pool the network responses for all classes, then do mean¬ 
pooling over the frames within a video snippet and use the 
result as a feature. 

PLACES and HYBRID Finally, we use the recent scene 
classiflcation CNNs [70] featuring 205 scene classes. We 
use both available networks: Places-CNN and Hybrid- 
CNN, where the first is trained on the Places dataset [70] 
only, while the second is additionally trained on the 1.2 mil¬ 
lion images of ImageNet (ILSVRC 2012) [57]. We run the 
classifiers on all the extracted frames of our dataset. We 
mean-pool over the frames of each video snippet, using the 
result as a feature. 

4. Approaches to video description 

In this section we describe the approaches to video de¬ 
scription that we benchmark on our proposed dataset. 

Nearest neighbor We retrieve the closest sentence from 
the training corpus using the LI-normalized visual features 
introduced in Section 3.4 and the intersection distance. 

SMT We adapt the two-step translation approach of [56] 
which uses an intermediate semantic representation (SR), 
modeled as a tuple, e.g. {cut, knive, tomato). As the first 
step it learns a mapping from the visual input to the seman¬ 
tic representation (SR), modeling pairwise dependencies in 
a CRF using visual classifiers as unaries. The unaries are 
trained using an SVM on dense trajectories [66]. In the sec¬ 
ond step [56] translates the SR to a sentence using Statisti¬ 


cal Machine Translation (SMT) [36]. For this the approach 
concatenates SR as input language, e.g. cut knife tomato, 
and the natural sentence pairs as output language, e.g. The 
person slices the tomato. While we cannot rely on an an¬ 
notated SR as in [56], we automatically mine the SR from 
sentences using semantic parsing which we introduce in the 
next section. In addition to dense trajectories we use the 
features described in Section 3.4. 

SMT Visual words As an alternative on potentially 
noisy labels extracted from the sentences, we try to directly 
translate visual classifiers and visual words to a sentence. 
We model the essential components by relying on activity, 
object, and scene recognition. For objects and scenes we 
rely on the pre-trained models LSDA and PLACES. For 
activities we rely on the state-of-the-art activity recognition 
feature DT. We cluster the DT histograms to 300 visual 
words using k-means. The index of the closest cluster 
center from our activity category is chosen as label. To 
build our tuple we obtain the highest scoring class labels of 
the object detector and scene classifier. More specifically 
for the object detector we consider two highest scoring 
classes: for subject and object. Thus we obtain the tuple 
{SUBJECT, ACTIVITY, OBJECT, SCENE) 
{argmax{LSDA), DTi, argmax2{LSDA), 
argmax{PLACES)), for which we learn translation 
to a natural sentence using the SMT approach discussed 
above. 

5. Semantic parsing 

Learning from a parallel corpus of videos and sentences 
without having annotations is challenging. In this section 
we introduce our approach to exploit the sentences using 
semantic parsing. The proposed method aims to extract an¬ 
notations from the natural sentences and make it possible 
to avoid the tedious annotation task. Later in the section 
we perform the evaluation of our method on a corpus where 
annotations are available in context of a video description 
task. 

5.1. Semantic parsing approach 

We lift the words in a sentence to a semantic space of 
roles and WordNet [53, 24] senses by performing SRL (Se- 
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Phrase 

WordNet 

VerbNet 

Expected 


Mapping 

Mapping 

Frame 

the man 

man#l 

Agent, animate 

Agent: man#l 

begin to shoot 

shoot#2 

shoot#vn#2 

Action: shoot#2 

a video 

video#1 

Patient, solid 

Patient: video# 1 

in 

in 

PP.in 


the moving 

bus#l 

NP. Location. 

Location: mov¬ 

bus 


solid 

ing bus#l 


Table 3: Semantic parse for ''He began to shoot a video in 
the moving bus'\ Discussion see Section 5.1 


mantic Role Labeling) and WSD (Word Sense Disambigua¬ 
tion). For an example, refer to Table 3, the expected out¬ 
come of semantic parsing on the input sentence "He shot 
a video in the moving bus''is "Aqent: man. Action: 
shoot. Patient: video. Location: bus”. Ad¬ 
ditionally, the role fillers are disambiguated. 

We use the ClausIE tool [17] to decompose sentences 
into their respective clauses. For example, "he shot and 
modified the video" is split into two phrases "he shot the 
video" and "the modified the video"). We then use the 
OpenNLP tool suite^ for chunking the text of each clause. 
In order to provide the linking of words in the sentence to 
their WordNet sense mappings, we rely on a state-of-the-art 
WSD system, IMS [69]. The WSD system, however, works 
at a word level. We enable it to work at a phrase level. For 
every noun phrase, we identify and disambiguate its head 
word (e.g. the moving bus to "bus#l", where "bus#l" 
refers to the first sense of the word bus). We link verb 
phrases to the proper sense of its head word in WordNet 
(e.g. begin to shoot to "shoot#2"). 

In order to obtain word role labels, we link verbs to 
VerbNet [60, 33], a manually curated high-quality linguis¬ 
tic resource for English verbs. VerbNet is already mapped 
to WordNet, thus we map to VerbNet via WordNet. We 
perform two levels of matches in order to obtain role la¬ 
bels. First is the syntactic match. Every VerbNet verb 
sense comes with a syntactic frame e.g. for shoot, the 
syntactic frame is np v np. We first match the sentence’s 
verb against the VerbNet frames. These become candi¬ 
dates for the next step. Second we perform the seman¬ 
tic match: VerbNet also provides a role restriction on the 
arguments of the roles e.g. for shoot (sense killing), the 
role restriction is Agent. animate V Patient. animate 
PP Instrument. solid. The Other sense for shoot 
(sense snap), the semantic restriction is Agent. animate 
V Patient .solid. We only accept candidates from the 
syntactic match that satisfy the semantic restriction. 

^http://opennlp.sourceforge.net/ 


Input: 

Someone puts the tools back in the shed. 


Output: 


text 

role 

sense 

WordNet synset 

someone 

SUBJECT 

100007846 

{person, individual, someone,...} 

put-back 

VERB 

201308381 

{replace, put back} 

[the-tool 

OBJECT 

104451818 

{tool} 1 

the-shed 

LOCATION 

104187547 

{shed} 


(a) Semantic representation extracted from a sentence. 


• The van pulls into the forecourt. 

- sense of {pull}-, move into a certain direction 

• Someone pulls the purse imperceptibly closer to himself. 

- sense of (pull,draw,force}: cause to move by pulling 

• People play a fast and furious gome. 

- sense of {play}: participate in games or sport 

• At one end of the room an orchestra is playing. 

- sense of {play}: play on an instrument 

(b) Same verb, different senses. 

• Someone leaps onto a bench by a couple hugging. 

• Someone drops to his knee to embrace his son. 

- sense of {hug, embrace}: squeeze (someone) tightly in your arms, 
usually with fondness 

• Someone spins and grabs his car-door handle. 

• And someone takes hold of her hand. 

- sense of {grab, take hold of}: take hold of so as to seize or 
restrain or stop the motion of 

(c) Different verbs, same sense. 

Eigure 3: Semantic parsing example, see Section 5.1 


VerbNet contains over 20 roles and not all of them are 
general or can be recognized reliably. Therefore, we fur¬ 
ther group them to get the SUBJECT, VERB, OBJECT and 
LOCATION roles. We explore two approaches to obtaining 
the labels based on the output of the semantic parser. Eirst 
is to use the extracted text chunks directly as labels. Second 
is to use the corresponding senses as a labels (and there¬ 
fore group multiple text labels). In the following we refer to 
these as text- and sense-labels. Thus from each sentence we 
extract a semantic representation in a form of (SUBJECT, 
VERB, OBJECT, LOCATION), see Eigure 3a for example. 
Using the WSD allows to identify different senses (Word- 
Net synsets) for the same verb (Eigure 3b) and the same 
sense for different verbs (Eigure 3c). 

5.2. Applying parsing to TACoS Multi-Level corpus 

We apply the proposed semantic parsing to the TACoS 
Multi-Level [55] parallel corpus. We extract the SR from 
the sentences as described above and use those as anno¬ 
tations. Note, that this corpus is annotated with the tu- 
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Approach 

BLEU 

SMT [56] 

24.9 

SMT [55] 

26.9 

SMT with our text-labels 

22.3 

SMT with our sense-labels 

24.0 


Table 4: BLEU@4 in % on sentences of Detailed Descrip¬ 
tions of the TACoS Multi-Level [55] corpus, see Section 

5.2. 


Annotations 

activity tool object 

source target 

Manual [55] 

78 

53 138 

69 49 


verb 

object 

location 

Our text-labels 

145 

260 

85 

Our sense-labels 

158 

215 

85 


Table 5: Label statistics from our semantic parser on TACoS 
Multi-Level [55] corpus, see Section 5.2. 


pies (ACTIVITY, OBJECT, TOOL, SOURCE, TARGET) 
and the subject is always the person. Therefore we drop 
the SUBJECT role and only use (VERB, OBJECT, LOCA¬ 
TION) as our SR. Then, similar to [55], we train the visual 
classifiers for our labels (proposed by the parser), we only 
use the ones that appear at least 30 times. Next we train a 
CRE with 3 nodes for verbs, objects and locations, using the 
visual classifier responses as unaries. We follow the trans¬ 
lation approach of [56] and train the SMT on the Detailed 
Descriptions part of the corpus using our labels. Einally, 
we translate the SR predicted by our CRE to generate the 
sentences. Table 4 shows the results comparing our method 
to [56] and [55] who use manual annotations to train their 
models. As we can see the sense-labels perform better than 
the text-labels as they provide better grouping of the labels. 
Our method produces competitive result which is only 0.9% 
below the result of [56]. At the same time [55] uses more 
training data, additional color Sift features and recognizes 
the dish prepared in the video. All these points, if added to 
our approach, would also improve the performance. 

We analyze the labels selected by our method in Table 
5. It is clear that our labels are still imperfect, i.e. different 
labels might be assigned to similar concepts. However the 
number of extracted labels is quite close to the number of 
manual labels. Note, that the annotations were created prior 
to the sentence collection, so some verbs used by humans in 
sentences might not be present in the annotations. 

Erom this experiment we conclude that the output of our 
automatic parsing approach can serve as a replacement of 
manual annotations and allows to achieve competitive re¬ 
sults. In the following we apply this approach to our movie 
description dataset. 


Correctness Relevance 

DVS 63.0 60.7 

Movie scripts 37.0 39.3 

Table 6: Human evaluation of DVS and movie scripts: 

which sentence is more correct/relevant with respect to the 
video, in %. Discussion in Section 6.1. 


Corpus 

Clause 

NLP 

Labels 

WSD 

TACoS Multi-Level [55] 

0.96 

0.86 

0.91 

0.75 

Movie Description (ours) 

0.89 

0.62 

0.86 

0.7 


Table 7: Semantic parser accuracy for TACoS Multi-Level 
and our new corpus. Discussion in Section 6.2. 


6. Evaluation 

In this section we provide more insights about our movie 
description dataset. Eirst we compare DVS to movie script 
and then we benchmark the approaches to video description 
introduced in Section 4. 

6.1. Comparison DVS vs script data 

We compare the DVS and script data using 5 movies 
from our dataset where both are available (see Section 3.2). 
Eor these movies we select the overlapping time intervals 
with the intersection over union overlap of at least 75%, 
which results in 126 sentence pairs. We ask humans via 
Amazon Mechanical Turk (AMT) to compare the sentences 
with respect to their correctness and relevance to the video, 
using both video intervals as a reference (one at a time, re¬ 
sulting in 252 tasks). Each task was completed by 3 dif¬ 
ferent human subjects. Table 6 presents the results of this 
evaluation. DVS is ranked as more correct and relevant in 
over 60% of the cases, which supports our intuition that 
scrips contain mistakes and irrelevant content even after be¬ 
ing cleaned up and manually aligned. 

6.2. Semantic parser evaluation 

Table 7 reports the accuracy of the different compo¬ 
nents of the semantic parsing pipeline. The components are 
clause splitting (Clause), PCS tagging and chunking (NLP), 
semantic role labeling (Labels) and word sense disambigua¬ 
tion (WSD). We manually evaluate the correctness on a ran¬ 
domly sampled set of sentences using human judges. It is 
evident that the poorest performing parts are the NLP and 
the WSD components. Some of the NLP mistakes arise due 
to incorrect PCS tagging. WSD is considered a hard prob¬ 
lem and when the dataset contains less frequent words, the 
performance is severely affected. Overall we see that the 
movie description corpus is more challanging than TACoS 
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Correctness 

Grammar 

Relevance 

Nearest neighbor 

DT 

7.6 

5.1 

7.5 

LSDA 

7.2 

4.9 

7.0 

PLACES 

7.0 

5.0 

7.1 

HYBRID 

6.8 

4.6 

7.1 

SMT Visual words: 

7.6 

8.1 

7.5 

SMT with our text-labels 

DT 30 6.9 

8.1 

6.7 

DT 100 

5.8 

6.8 

5.5 

All 100 

4.6 

5.0 

4.9 

SMT with our sense-labels 

DT 30 6.3 

6.3 

5.8 

DT 100 

4.9 

5.7 

5.1 

All 100 

5.5 

5.7 

5.5 

Movie script/DVS 

2.9 

4.2 

3.2 


Table 8: Comparison of approaches. Mean Ranking (1-12). 
Lower is better. Discussion in Section 6.3. 


Multi-Level but the drop in performance is reasonable com¬ 
pared to the siginificantly larger variability. 

6.3. Video description 

As the collected text data comes from the movie context, 
it contains a lot of information specific to the plot, such as 
names of the characters. We pre-process each sentence in 
the corpus, transforming the names and other person related 
information (such as “a young woman”) to “someone” or 
“people”. The transformed version of the corpus is used in 
all the experiments below. We will release the transformed 
and the original corpus. 

We use the 5 movies mentioned before (see Section 3.2) 
as a test set for the video description task, while all the oth¬ 
ers (67) are used for training. Human judges were asked to 
rank multiple sentence outputs with respect to their correct¬ 
ness, grammar and relevance to the video. 

Table 8 summarizes results of the human evaluation from 
250 randomly selected test video snippets, showing the 
mean rank, where lower is better. In the top part of the ta¬ 
ble we show the nearest neighbor results based on multiple 
visual features. When comparing the different features, we 
notice that the pre-trained features (LSDA, PLACES, HY¬ 
BRID) perform better than DT, where HYBRID perform¬ 
ing best. Next is the translation approach with the visual 
words as labels, performing overall worst of all approaches. 
The next two blocks correspond to the translation approach 
when using the labels from our semantic parser. After ex¬ 
tracting the labels we select the ones which appear at least 
30 or 100 times as our visual attributes. As 30 results in a 


Annotations 

subject 

verb 

object 

location 

text-labels 30 

24 

380 

137 

71 

sense-labels 30 

47 

440 

244 

no 

text-labels 100 

8 

121 

26 

8 

sense-labels 100 

8 

143 

51 

37 


Table 9: Label statistics from our semantic parser on the 
movie description corpus. 30 and 100 indicate the minimum 
number of label occurrences in the corpus, see Section 6.3. 


much higher number of attributes (see Table 9) predicting 
the SR turns into a more difficult recognition task, result¬ 
ing in worse mean rankings. “All 100” refers to combining 
all the visual features as unaries in the CRF. Finally, the last 
“Movie script/DVS” block refers to the actual test sentences 
from the corpus and not surprisingly ranks best. 

Overall we can observe three main tendencies: (1) Using 
our parsing with SMT outperforms nearest neighbor base¬ 
lines and SMT Visual words. (2) In contrast to the kitchen 
dataset, the sense labels perform slightly worse than the text 
labels, which we attribute to the errors made in the WSD. 
(3) The actual movie script/DVS are ranked on average 
significantly better than any of the automatic approaches. 
These tendencies are also refiected in Figure 4, showing ex¬ 
ample outputs of all the evaluated approaches for a single 
movie snippet. Examining more qualitative examples which 
we provide on our web page indicates that it is possible to 
learn relevant information from this corpus. 

7. Conclusions 

In this work we presented a novel dataset of movies with 
aligned descriptions sourced from movie scripts and DVS 
(audio descriptions for the blind). We present first experi¬ 
ments on this dataset using state-of-the art visual features, 
combined with a recent movie description approach from 
[56]. We adapt the approach for this dataset to work with¬ 
out annotations, but rely on semantic parsing of labels. We 
show competitive performance on the TACoS Multi-Level 
dataset and promising results on our movie description data. 
We compare DVS with previously used script data and find 
that DVS tends to be more correct and relevant to the movie 
than script sentences. Beyond our first study on single sen¬ 
tences, the dataset opens new possibilities to understand sto¬ 
ries and plots across multiple sentences in an open domain 
scenario on large scale. Something no other video nor im¬ 
age description dataset can offer as of now. 
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Nearest neighbor 

DT People stand with a happy group, including someone. 

LSDA The hovering Dementors chase the group into the lift. 

HYBRID Close by, a burly fair-haired someone in an orange jumpsuit runs down a dark street. 

PLACES Someone is on his way to look down the passage way between the houses. 

SMT Visual words Someone in the middle of the car pulls up ahead 

SMT with our text-labels 

DT 30 Someone opens the door to someone 

DT 100 Someone, the someone, and someone enters the room 

All 100 Someone opens the door and shuts the door, someone and his someone 

SMT with our sense-labels 

DT 30 Someone, the someone, and someone enters the room 

DT 100 Someone goes over to the door 

All 100 Someone enters the room 

Movie script/DVS Someone follows someone into the leaky cauldron 


Figure 4: Qualitative comparison of different video description methods. Discussion in Section 6.3. More examples on our 
web page. 
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